feat(base): per-token is_content mask for body/scaffold attribution#53
Conversation
Generalizes sampled_mask across all roles. is_content[k] is True iff
token k came from message-body bytes — caller-provided content /
tool_calls / reasoning_content, or the model's sampled emission for
assistant — and False iff template scaffolding (role tags, closers
when not sampled, inter-turn separators, tool-response wraps,
tools-header block, generation prompt). By construction is_content ==
sampled_mask over every assistant-attributed token; carries new
information on every other role where sampled_mask is uniformly False.
Enables SFT on tool response bodies while applying RL only to
assistant tokens — build_training_sample(..., content_sft_roles={"tool"})
trains the model to anticipate tool outputs without learning to emit
the surrounding <|tool_response>/role-tag scaffold (which would
interrupt a real rollout).
New on RenderedTokens:
- is_content: list[bool] field (empty when the renderer opts out, same
policy as sampled_mask)
- content_token_spans_by_role()
- content_mask_for_roles(roles)
New module-level helpers in base.py:
- attribute_text_segments(tokenizer, segments) — single-BPE-pass
attribution via offset_mapping; auto-loads a vanilla offset-capable
tokenizer when the supplied one doesn't track offsets (fastokens
patch), cached process-globally per model name.
- build_training_sample(..., content_sft_roles=...) — opt-in body-only
supervision for roles the model never samples. Falls back to the
prior role_to_mask + sampled_mask behaviour when is_content is
empty.
Wired through every hand-coded renderer: qwen3, qwen3.5, qwen3.6
(inherits), qwen3-vl, glm5, glm5.1, glm4.5, kimi-k2, kimi-k2.5/2.6,
minimax-m2, deepseek-v3, nemotron-3, laguna-xs.2, gpt-oss. Concatenated
wrap+body emits go through emit_text_segments (or per-renderer
equivalents) so BPE merges at the boundary stay byte-identical with the
prior single-emit path. Renderers whose tokenizer doesn't support
offset_mapping (Kimi, MiniMax with its known fastokens edge case) use
boundary-aware emit patterns or a per-renderer overlap rule to keep
body bytes recoverable.
Multimodal placeholders (<|image_pad|>, <|media_pad|>) are body
(is_content=True) — they represent caller-provided image data in
token form. The surrounding vision/media wrap specials are scaffold.
Fixed along the way:
- nemotron3: off-by-one in message_roles when a default system was
auto-injected. Now indexes the caller-provided message list.
- kimi_k2: same off-by-one fixed via a caller_messages snapshot.
Tests: 10 invariants × 17-model matrix in tests/test_is_content.py.
Token IDs stay byte-identical vs apply_chat_template across every
renderer; existing test_render_ids / test_sampled_mask / test_bridge /
test_build_helpers / test_tokens_per_message / test_roundtrip /
test_multimodal stay green. DefaultRenderer leaves is_content empty
(Jinja is opaque), same policy as sampled_mask.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Output of `uv run ruff format` after wiring is_content through the renderers in the previous commit. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ApprovabilityVerdict: Needs human review New feature adding per-token body/scaffold attribution ( No code changes detected at You can customize Macroscope's approvability policy. Learn more. |
|
Hi! How can I help? I can:
Just let me know what you need! |
|
please run a correctness check |
|
@macroscope-app please run a correctness check |
|
Correctness check is running. Results will be posted when complete. Previously completed check runs: |
…y_role content_token_spans_by_role only checked that is_content matched token_ids length, but it then walks spans returned by message_token_spans() — whose values are indices into message_indices. When len(message_indices) > len(token_ids), the span end can exceed len(is_content), and the inner loop dereferences self.is_content[k] past the end, raising IndexError. The sister method content_mask_for_roles already checks both lengths (introduced in the same commit). Match the precedent. Not reachable through the renderer pipeline today — every hand-coded renderer populates the four parallel lists in lock-step — but the dataclass doesn't enforce the invariant, so a manually-constructed RenderedTokens (e.g. a test fixture) could trip it. The guard is cheap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate() already calls renderer.render() internally when the caller
doesn't pre-supply prompt_ids, producing a RenderedTokens that carries
token_ids, message_indices, sampled_mask, is_content, message_roles,
and multi_modal_data. Previously we surfaced only token_ids and
multi_modal_data and dropped the rest at the function boundary.
Callers that wanted per-token attribution downstream (verifiers'
RendererClient → prime-rl, for SFT-on-tool-body / selective loss
masking) had no way to recover it without a second render pass.
Two surfaces:
- New return field ``prompt_attribution``: the full RenderedTokens for
the prompt — either the one this call computed via render() or the
one the caller threaded in alongside prompt_ids. Downstream consumers
call e.g. ``content_mask_for_roles({"tool"})`` on it to build the
SFT-on-tool-body mask without re-rendering.
- New optional parameter ``prompt_attribution``: callers that pre-built
prompt_ids (the multi-turn bridge path in verifiers) can hand in the
RenderedTokens that bridge_to_next_turn returned, and it surfaces on
the result unchanged.
Mirrors the existing multi_modal_data plumbing — same shape, same
None-default-when-unknown semantics. No behavioural change for callers
that don't read the new field.
Tests:
- Existing parse-and-build test now asserts prompt_attribution carries
every populated RenderedTokens field through verbatim.
- New test for the pre-built-prompt-with-attribution path (the bridge
shape verifiers will use).
- New assertion that pre-built-prompt-without-attribution returns
``None`` so callers can detect the gap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@macroscope-app please run a correctness check again |
|
Correctness check complete. No issues found. |
02c3e85 to
16c04a9
Compare
hallerite
left a comment
There was a problem hiding this comment.
Lgtm overall. One concern I have is with the multimodal models & their image pad tokens – how is this handled when they are expanded vs. un-expanded?
Summary
is_content: list[bool]toRenderedTokens— a per-token signal that generalisessampled_maskacross all roles:Trueiff the token came from message-body bytes (caller-providedcontent/tool_calls/reasoning_content, or the model's sampled emission for assistant),Falseiff template scaffolding (role tags, closers when not sampled, separators, tool-response wraps, tools-header block, generation prompt).build_training_sample(..., content_sft_roles={"tool"})so a single render produces a loss mask that combines RL on assistant tokens with SFT on tool response bodies — without supervising the surrounding<|tool_response>/ role-tag specials that would interrupt a real rollout.apply_chat_template.renderers.client.generate()returns the renderer's per-token attribution asprompt_attribution: RenderedTokens, so downstream consumers (verifiersRendererClient→ prime-rl) carry the body/scaffold cut to the trainer without re-rendering.Motivation
For RL the policy loss applies only to tokens the model emitted. A useful auxiliary objective is SFT on tool response bodies — supervise the model to anticipate what tools return, without supervising the wrap. If the model learns to emit
<|tool_response>itself, it can derail a rollout by short-circuiting the harness.sampled_maskanswers "would the model emit this?", which is the right cut for assistant tokens but is uniformlyFalseon non-assistant roles. There is no way to ask "which tokens came from message-body bytes" on tool / user / system messages usingsampled_maskalone.is_contentis that signal. For a tool message wrapped as<|im_start|>user\n<|tool_response>\n{body}\n<|tool_response_end|><|im_end|>\n,is_contentisTrueonly on the{body}tokens — never on the<|tool_response>specials or the inter-section newlines.By construction
is_content == sampled_maskover every assistant-attributed token; on every other rolesampled_maskis uniformlyFalseandis_contentcarries informationsampled_maskcannot.is_contentis a strict superset (or equal) ofsampled_maskeverywhere and never contradicts it.API
On
RenderedTokens:is_content: list[bool]— same length / empty policy assampled_mask. Empty means the renderer opts out (DefaultRenderer leaves it empty for the same reason it leavessampled_maskempty: Jinja is opaque).content_token_spans_by_role() -> dict[str, list[tuple[int, int]]]— contiguous body-only token runs grouped by message role.content_mask_for_roles(roles) -> list[bool]— per-token bool mask,Trueonly on body tokens whose message role is in the supplied set.Module-level in
renderers.base:attribute_text_segments(tokenizer, segments)— single-BPE-pass attribution viaoffset_mapping. When the supplied tokenizer doesn't track offsets (fastokens patch), lazy-loads a vanilla offset-capable tokenizer for the same model and caches it process-globally.build_training_sample(..., content_sft_roles=...)— opt-in body-only supervision for roles the model never samples. Falls back to therole_to_mask + sampled_maskbehaviour whenis_contentis empty.On
renderers.client.generate():prompt_attribution: RenderedTokens | None— the per-token attribution for the prompt, either the one this call computes viarender()internally or the one the caller threaded in alongsideprompt_ids. Downstream consumers callattr.content_mask_for_roles({"tool"})on it to build selective loss masks without re-rendering.prompt_attribution: RenderedTokens | None = None— callers that pre-buildprompt_ids(the multi-turn bridge path in verifiers) hand in theRenderedTokensthatbridge_to_next_turnreturned, and it surfaces on the result unchanged.The new field on
generate()mirrors the existingmulti_modal_datasidecar — same shape, same None-default-when-unknown semantics.How it works
Every renderer has emit sites like
emit_text("user\n" + content, ...)that join wrap text and body text into one BPE pass to preserve token merges at the boundary. Theemit_text_segments(...)helper (defined locally in each renderer) does the same join with per-token attribution:offset_mappingto recover each token's character span.fastokens(the Rust BPE patched in by default for ~10x faster encode) doesn't track offsets.attribute_text_segmentstransparently loads a vanilla offset-capable tokenizer for the same model and caches it process-globally per model name. Most models inMODEL_RENDERER_MAPproduce byte-identical token IDs between fastokens and vanilla, so the mix is safe; models inFASTOKENS_INCOMPATIBLEalready use vanilla everywhere.A few renderers use tokenizers that can't provide offset mapping at all and rely on per-renderer alternatives:
TikTokenTokenizer. Avoids concatenated wrap+body emits to begin with — Kimi's structure splits wrap and body at special-token boundaries, so threadingis_contentthrough the split emits suffices.<response>and the body's first letter under certain tokenizer load orders. A localemit_token_overlap_bodyhelper picks the overlap rule so the body's leading byte stays recoverable from its body run.Per-renderer coverage
qwen3qwen3.5enable_thinkingpolarity preserved.qwen3.6Qwen35Renderer; only overrides a pure string serializer, so it picks upis_contentthrough the parent class.qwen3-vl<|image_pad|>placeholders are body (is_content=True); the surrounding<|vision_start|>/<|vision_end|>are scaffold.glm5/glm5.1GLM5Renderercovers both via subclass. Also coverszai-org/GLM-4.7-Flash.glm4.5<|observation|>/<tool_response>wraps are scaffold; body is content.kimi-k2TikTokenTokenizer— uses existing split-emit boundaries (noattribute_text_segments).kimi-k2.5/kimi-k2.6<|media_pad|>is body;<|media_begin|>...<|media_end|>wrap is scaffold.minimax-m2FASTOKENS_INCOMPATIBLE(vanilla everywhere). Local overlap helper for<response>BPE merge.deepseek-v3FASTOKENS_INCOMPATIBLE(Metaspace pretokenizer). Standard wrap/body split.nemotron-3emit_text_segmentsfor\nboundaries.laguna-xs.2gpt-ossfunctions.{name}text on tool result messages is scaffold (comes from prior assistanttool_calls, not this tool's content).DefaultRendererleavesis_contentempty.Tests
tests/test_is_content.py— 10 invariants × 17-model matrix:token_idsor is empty (opt-out).is_content == sampled_maskover assistant tokens.is_content=False.is_content=Truerun.is_content=False.content_token_spans_by_role()isolates tool body cleanly.content_mask_for_roles({"tool"})excludes assistant.build_training_sample(..., content_sft_roles={"tool"})trains tool body + assistant, never user.tests/test_client.pycovers theprompt_attributionsurface ongenerate():prompt_attributioncarries every populatedRenderedTokensfield through verbatim.prompt_idsandprompt_attribution) passes attribution through unchanged.Noneso callers can detect the gap.Full suite collects 1557 tests — all pass (modulo pre-existing gpt-oss HF-parity skips and one unrelated xfailed).
test_render_idsbyte-identity vsapply_chat_templateis green on every renderer.Additional fixes
nemotron3:message_roleswas sourced from the auto-injected normalised list, off-by-one when a default system was prepended. Now indexes the caller-provided message list.kimi_k2: same off-by-one fixed via acaller_messagessnapshot.Notes for the maintainer
bridge_to_next_turnpopulatesis_contenton the bridge-emitted portion only; the prior portion (previous_prompt_ids + previous_completion_ids) gets[False] * len(previous_ids)per the same conventionsampled_maskfollows on bridge output. Consumers walk the trajectory and read each step's ownis_contentfor full-conversation body masks.Note
[!NOTE]
Add per-token
is_contentbody/scaffold attribution mask to all renderersis_content: list[bool]field toRenderedTokensin renderers/base.py that marks each token as caller/model body (True) or template scaffolding (False).attribute_text_segmentsin renderers/base.py to tokenize concatenated(text, is_content)segments in a single BPE pass using offset mapping, preserving merge boundaries while attributing each token to the correct segment.is_contentpopulation across all renderers (qwen3,qwen35,qwen3_vl,deepseek_v3,gpt_oss,kimi_k2,kimi_k25,laguna_xs2,minimax_m2,nemotron3,glm45,glm5), includingrender,bridge/bridge_to_next_turn, and assistant/tool helpers.build_training_samplewith acontent_sft_rolesparameter that restricts loss to body-only tokens for specified roles usingis_content, leaving behavior unchanged when the field is absent or empty.content_token_spans_by_roleandcontent_mask_for_roleshelpers toRenderedTokensfor downstream span extraction.is_content == sampled_mask;message_rolesin some renderers now reflects the original caller message list rather than the post-normalized list.Macroscope summarized 16c04a9.
Note
Medium Risk
Touches core token attribution and training mask construction across many model-specific renderers; mistakes could silently change loss masking or prompt formatting. Mitigated by keeping token IDs byte-identical and by adding plumbing/tests, but the breadth of renderer changes raises regression risk.
Overview
Adds a new per-token
RenderedTokens.is_contentsignal to distinguish message body bytes from renderer-injected scaffold across all roles, plus helpers to extract body-only spans/masks.Extends
build_training_samplewithcontent_sft_rolesto optionally supervise body-only tokens for non-sampled roles (e.g. tool responses) while keeping scaffold tokens masked. Introducesattribute_text_segmentswith an offset-aware tokenizer cache to attribute tokens back to(text, is_content)segments without breaking BPE merges.Plumbs attribution through
bridge_to_next_turnandclient.generate()via a newprompt_attributionparameter/return field, and updates all hand-coded renderers to populateis_content(including special handling for opaque/prefixed formats likegpt_ossand boundary-merge edge cases likeminimax_m2).Reviewed by Cursor Bugbot for commit 16c04a9. Bugbot is set up for automated code reviews on this repo. Configure here.